DOCUMENT RESUME 



ED 281 864 



TH 870 248 



AUTHOR 
TITLE 

INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 
PUB TYPE 



Stocking^ HarthaL. ; Eignbr , Daniel Ri 

The Impact of Different Ability Diistributions oh IRT 

Preeguating, 

Educational Testing Service^ Princeton, N.J. 

ETS-RR-86-4S) 

Dec 86 

91p. 

Reports - Research/Technical (143) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MFbl/PCb4 Plus Postage. 

College Entrance Exaininations; Cbinputer_Siiiiulatibh ; 
Equated Scores; Error, of Measureinent; ^Estimation 
(Mathematics); * Item Analysis; La tent Trait Theory; 
Mathematics Tests; ^Multidimensional Scaling; Test 
Items 

*Item Parameters; LOGIST Computer Program; *Pre 
Eguating (Tests) ; Scholastic Aptitude Test; Three 
Parameter Model 



ABSTRACT 

in item response theory (IRT), preeguating depends 

upon item parameter estimate invariance^ Three separate simulations, 
all usihgthe unidimensionai thr logistic item response 

models were conducted to study the impact of the following variables 
oh jpreeguating: (1) mean differences in ability; (2) 
multidimehsiohality in the data; and (3) a combination of mean 
differences in ability and multidimensionaiity. One of the Scholastic 
Aptitude Test mathematical forms (3ASA3) which provided the least 
acceptable preeguating was selected to define true item and person 
parameters for these simulations. A random sample of 2^74 
was used. The LOGIST computer program was chosen to estimate item 
parameters for the 60 items in 3ASA3 and the 24 items in the equating 
section fn^ Results showed that differences in mean trtie ability can 
cause differences in the precision with which a particular estimation 
procedure estimates parameters.. The introduction b£ a particular kind 
of multidimensiohality in the data can have a large impact on 
estimation precision when the IRT model is uhidimehsibnal. The 
combination of a slightdecrease in mean ability and a particular 
type of multidimensionality in the data also has a large impact on 
estimation precision when the IRT model is unidimehsional , although 
the impact is lessened somewhat. (JAZ) 



* Reproductions supplied by EDRS are the best that can be made * 

* _ from the original document. * 
****************** 



EKLC 



RR-86-49 



H 



R 



O 



T 



THE IMPACT OF DIFFERENT ABILITY DISTRIBUTIONS ON 

IRT PREEQUATING 



Martha L. Stocking 
and 

Daniel R. Eignor 



ERIC 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANtED,BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



BEST COPY AVAItABkE 




_ ^1^8 _bE PARTMENT or EPUCATION 

Office of EducatKKiaJ Research and Improvement 

EbUCAXibNAL «E SOURCES INFORMATION 

-y/^ CENTER (ERIC) 
l^>!l^iS-. d-ocurhefit- has- be«n reproduced as 
received from the person or organization 
_ Oflginatir>g it 

□ Minor ci^arYQes liave been made to improve 
reproduction quality. 

• Points ofview or opinions stated 'n this docu- 
ment do- not- necessarily represent official 
OeRI position or policy. 



Educational letting Service 
Princeton, New Jersey 
December 1986 



The Impact of Different Ability Distributions on iRT Preequatlng^ »2 , 3 

Martha L« Stocking 
and 

Daniel R. Eignor 
Educational Testing Service 



October 1986 



^An earlier version of this paper was presented at the annual 
meeting of AERA^ San Francisco, 1986. 

^Tt^ig study was supported by Educational Testing Service through 
Program Research Planning Council funding. 

^The authors would like to acknowledge the advice of Marilyn 
Wlngersky and the assistance of Nancy Wright In performing this study. 



3 



i 



Copyright @ 1986. Educational testing Service. All rights reserved. 



ERIC 



ABSTRACT 



item response theory preequatlng depends upon item parameter 
estimate Invarlance. The Impact of differences In true ability on 
the invarlance properties of item parameter estimates was studied 
with simulated data. Using real SAT-mathematical data that had 
produced unsatisfactory preequatlng results to suggest hypotheses, 
three explanatory models were Investigated: 1) differences in 
mean true ability, 2) a certain type of multidimenisionality , and 
3) a combination of differences in mean true ability and 
muitidimensionality. This latter model produced results consistent 
with the real data. 



IRT Preequating 
1 

The impact of Different Ability Diistributions on IRT Preequating 

Martha Stocking 
and 

Daniel R. Eigrior 
INTRODUCTION 

in item response theory (IRT), when model assumptions are 
satisfied, true item parameters do not change even when considered 
across samples with different true abilities from the sane 
population. Likewise, true abilities do not change, even when 
considered in reference to different sets of items (Lord^ 1980). 
This is called the 'invariance' property of the true item and 
person parameters. 

The invariance property of true item parameters suggests that 
it is possible to equate a test before it is actually administered, 
as long as true item parameters are known. This is called 
•preequating'. The invariance property of true abilities suggests 
that adaptive testing, where individuals take different sets of 
items, is possible. 

How well either of these two novel ideas works in practice 
depends not upon the true item parameters or person parameters, but 
rather, on ESTIMATES of them. To the extent that estimates fail to 
approximate truth, both preequating and adaptive testing will fail. 
While there may be many specific reasons why estimates do not 
approximate truth very well, the r«iasons can generally be 



8 

o 

ERIC 



IRT Preequating 
2 

classified into two broad categories: reasons having to do with the 
imprecision introduced by various estimation procedures currently in 
use; and reasons having to do with the failure of the data to 
satisfy the underlying assumption(s) of the particular IRT model 
used* 

Two recent studies of preequating SAT verbal and mathematical 
data using the three-parameter logistic (3PL) item response model 
showed disappointing results in the face of reasonable evaluative 
criteria (Eignor^ 1985; Eigrior & Stocking^ 1986) • These large scale 
studies showed that only one of two verbal preequatings was adequate, 
and that neither of two mathematical preequatings was adequate. 
Explorations of many reasonable explanatory hypotheses were 
conducted, but no definitive answers were found. It was suggested 
that differences in abilities across samples might somehow cause the 
results found, namely that the tests under study had higher raw score 
to scale conversions, i.e., appeared to be more difficulty when 
preequated than when equated using intact final form data from a 
regular administration. This hypothesis was further strengthened by 
two observations: 

1) Items tiended to be harder in pretest form than in intact 

operational administrations ^ i. e.^ the b 's were higher, 

2) Pretest samples tended to have lower abilities than intact 

*orm administration samples ^ as measured by the scaled 



IRT Preequatlng 
3 

score meaiiis the preteist samples attained on the Intact 

forms accompanying the pretests. 
Sample differences could cause the results found In the 
preequatlng studies either because such differences Introduce 
different estimation errors, or because they In fact represent a 
violation of model assumptions, or both. 

This current study attempts to simplify the study of 
preequating by using simulated data. Because the data are 
simulated, one can more easily study the effects of sample 
variation on preequatlng. Three iseparate simulations ^ all using the 
unidimensionai three-parameter logistic item response model ^ were 



conducted. These simul^ations were designed to study the following 
variables: 

1) Mean differences in ability. 

Samples of data that vary only by a shift in the average 
true ability were simulated. While different estimation 
errors do impact preequatlng, the results of this 
simulation did not explain the previous real-data 
results. 

2) Muitidimensionality in t e data. 

A certain type of muitidimensionality was Introduced into 
different simulated samples. The data were analyzed with 
a unidimensionai item responise models thus violating 
model assumptions. The effects on preequatlng were 




8 



IRT Preequating 
4 

partially consistent with results from real data, 
although much larger* 
3) Mean differences in ability and multidimensionallty in the 
data* 

This simulation combined the two types of sample 
variations studied above. The effects on preequating 
were more moderate than those found in the second study, 
although larger than those seen with real data. These 
results, however^ were completely consistent with the 
real data results. 

l^THODOLOGY 
The Definition of Truth 
One of the two SAT mathematical forms from the previous studies 
was selected to define true item and person parameters for these 
simulations. The form chosen^ 3ASA3^ provided the least acceptabla 
raathenatical preequating. Using a random sample of 2744 examinees 
from the operational administration of this form with equating 
section fn, item parameters for the 60 items in 3ASA3 and the 24 
items in fn were estimated using LOGIST (Wingersky, Barton, & Lord, 
1982); These item atid person parameter estimates were then used as 
realistic true item and person parameters. Table 1 gives summary 
statistics for these true parameters. 



9 

o 

ERIC 



IRT Preequatlng 
5 

Insert Table 1 about here 

When 3ASA3 was first adtninlstered as an intact test formi it 
was equated to the familiar College Board scale for score reporting 
purposes^ The particular equating chosen at that time was a linear 
one* For purposes of these simulations , this linear equating will 
be, by definition, the 'true' equating associated with the true item 
and person parameters. 

Using the frequency distribution of observed scores for all 
individuals who took 3ASA3 at this first administration, a 'true' 
scaled score mean of 485, and a 'true' scaled score standard 
deviation of 113 were computed. Comparisons among simulated 
equatings will frequently be made in reference to these 'true' 
values. 

Fir s t ^tmuiatioit : Mean. Dif fere ncea in Ability 
The first simulation was designed to explore the hypothesis 
that preequawHgs that produce higher scaled score means (meaning 
that the preequated test appears to be more difficult) result from 
less able oreequating samples. Wliile this idea is plausible, it 
challenges the efforts to produce item parameter estimates that 
exhibit the invariance property of true item parameters. 

the eimolation was designed to mirror, as much as possible, 
differences observed in the summary statistics for real data. For 
3ASA3, 13 out of the 14 samples on which items were pretested had 



ig 



IRT Preequating 
6 

lower iscaied score means bri the intact forms administered with the 
pretests than the sample taking 3ASA3 when given as an intact form. 
The lowest scaled score mean on all intact test forms given with 
3ASA3 items being pretested was 441. The scaled score mean for 
3ASA3 when given in intact form was 485. Using results from a 
typical IRT equating, this 44 scaled score point difference 
translates into a difference of about .35 on the IRT ability metric. 
Sample scaled score standard deviations varied only from 110 to 117 
in the previous studies. Therefore ^ no attempt was made here to 
sitnuj-ate differences in variances among the simulated samples. 
Simulated Samples 

Using the true abilities for 3ASA3 and fn, four different 
distributions of true abilities were independently generatesd with 
progressively lower true ability means (0^ -.35, -.70,. -1.05). 
These particular levels were chosen for two reasons: 1) the 
difference between the first two (.35) matches the largest mean 
decrease found in the real data^ and 2) it was hoped that futher 
decreases Would result in exaggeraged and therefore easily 
detectable effects resembling real-data results. Samples of N = 
2500 sitnulees were then drawn from each distribution. The bottom 
portion c f Table 2 presences the res'ilts of this sample selection. 

Insert Table 2 about here 



IRT Preequatlng 
7 

Responses to each item in 3ASA3 and fn were then generated 
using the true abilities for each sample and the true Item 
parameters for all 84 Items i 
Est imat i on -o£ Jl£em Paramet:grs^ ^d Abi l i tie s 

LOGiST (Wingersky, Barton, & Lord, 1982) was used to calibrate 
ail items and abilities in a single concurrent execution, with 
equating items fn used as an anchor to set the scales This method, 
described in detail In Petersen, Cook, and Stocking (1983), has 
provided satisfactory parameter scaling results In a number of 
studies; The N = 10000 and n = 264 data matrix can be 



represented as follows: 



Items 


fn 


3 AS A3- 1 


3ASA3-2 


3ASA3-3 


3ASA3-4 


People 












Sample 1 


X 


X 








Sample 2 


X 




X 






Sample 3 


X 






X 




Sample 4 


X 








X 



In this matrix, an x indicates that a group of items is taken by a 
particular group of examinees; a blank indicates that group of items 
is not administered to a group of examinees. The above design 
prodaci^s four different sets of item parameter estimates for the 
3AS\3 items. Each set of estimates differs only in the mean ability 



IRT Preequating 
8 

level of the group used for estimation* This mirrors a calibration 
of 'pretest' items (taken by samples 2, 3, and 4) and, 
simultaneously^ a calibration of 'operational' items taken by 
sample 1« 

Scaling of Estimates 

LOGIST establishes the metric upon which parameter estimates are 
reported by setting the mean and standard deviation of a truncated 
distribution of ability estimates to zero and one, respectively. The 
true item and ability parameters are on a different scale. 
Therefore^ before any comparisons can be made between estimated and 
true parameters, a scaling transformation is required. 

Sample 1 comes from the original distribution of true 
abilities. If Sample 1 estimated abilities differed from the true 
abilities by only a scaling factor, one could use the relationship 
between these estimated abilities and true abilities to determine 
the appropriate scaling transformations Since the estimates contain 
errors^ one can approximate the scaling transformation by 
determining for Sample 1 the transformation necessary to make robust 
measures of location and scale of the estimated abilities equal to 
robust measures of location and scale of the true abilities. This 
linear transformation can then be applied to ail estimates in the 
LOGIST run to place them on the same scale as the true values. 



13 



IRT Preequatlng 
9 

Coj nparlson -of ^tlmated arid True Parameters 

Summary statistics for the estimates of Item and person 
parameters after the scallrig trarisf ormatlbn are presented In Table 
2i In thljs table, It can be seen that the mean true ability as well 
as the mean estimated ability decrease across the four samples, a 
consequence of the study design. It Is Importarit to remember, 
however, that during the cailbratlbri process ^ parameters for all 
four samples were ejstlmated slmultariebusly. LOGIST standardizes Its 
results using the mean and standard devlatlbri bf all estimated 
abilities i This mean will lie somewhere between the means for 
Samples 2 and 3i Therefore, Samples 2 arid 3 are clbser tb the 
overall mean true ability during the calibration, arid Samples 1 and 
4 lie further away. 

Simple "box and whisker" plots that graphically show the 
relationships among the (distributions of estimated abilities are 
given in the top part of Figure 1. The horizbrital axis iri this 
figure represents ability. The left arid right asterisks mark the 
10th and 9eth percentiles of the distributibri. The left side and 
right sides of the box mark the 25th and 75th percentiles. The 
vertical bar in the box interior marks the 50th percentile. 

Insert Figure 1 about here 



IRT Preequating 
10 

Figures 2 through 5 compare estimated item parameters and 
estimated abilities for test 'forms' 3ASA3-1 , 3ASA3-2, 3ASA3-3, and 
3ASA3-4 with the true values. The different symbols on a single 
plot indicate the behavior of item parameters shown in other plots i 
Examination of these figures leads to the following observations: 

Insert Figures 2, 3, 4, and 5 about here 

1) Estimates of Item discriminations using Sample i data are 
generally too low. Sample 1 was the most able sampiei 
Estimates of item discriminations froE?. Sample 2 and 3 data 
are reasonably good. Estimates of item discriminations 
from Sample 4 data are generally too high. Sample 4 was 
the least able sample. 

2) The item difficulties for the samples closer to the overall 
mean true ability (2 and 3) are slightly better estimated 
(have less scatter) than item difficulties from the more 
extreme samples (1 and 4) • 

3) The difficulties for easy and hard items are less well 
estimated than those for less extreme items, regardless of 
the sample used. The most able sample (Sample 1) has more 
overestimated hard items. The least able sample (Sample 4) 
has more overestimated easy items. 



15 



IRT Preequating 
11 

4) The less able the sample, the better the estimates of c 
become. 

5) Low and hija;h abilities tend to be overestimated. Because 
Sample 1 is the most able sample, it has the greatest 
number of overestimated high abilities^ Because Sample 4 
is the least able sample, it has the greatest number of 
Qverestimated low abilities. The two middte samples have 
fewer overestimated abilities than the two extreme samples 
because they are closer to the overall mean true ability. 

The fact that the estimation procedure does not recover the 
true parameter values is not suprising. Any estimation procedure 
is imperfect. But it is important to understand why the procedure 
is systematically imperfect, because this will explain how 
estimation errors impact subsequent equatings. In this case, the 
explanation proceeds as follows: 

1) It has been obeserved that extreme abilities (either high 
or low) can be overestimated when the item parameters are 
not known (Lord, 1975, pi 16) i White an explanation for 
this phenomena is currently under development, it is 
important to to note that in other simulation studies, 
different estimation errors have sometimes been noted. 
The overestimation of high abilities is almost always 
observed. However, low abilities are sometimes observed 



IRT Preequating 
12 

to be underestimated (Wingersky, 1985) as well as 
overestimated • 

in addition, Lord (1975) shows that the 
overestimation can be greater for low abilities than for 
high abilities. Examination of the Figures 1 through 4 
show that this is the case here also. The extreme 
samples, 1 and 4, contain more overestimated abilities 
than the two middle samples. In addition the degree of 
overestimation for the low abilities in Sample 4, the 
least able sample, is greater than for the high abilities 
in Sample 1, the most able sample. 

2) in Sample 4, difficulty parameter estimates for easy items 
tend to more overestimated than parameters estimated for 
difficult items because lower abilities are more 
overestimated than high ones. in Sample 1, hard items 
tend to be more overestimated than easy ones because of 
the overestimation of high abilities. This is so because 
the overestimated abilities give erroneous information 
about item location. 

3) Wingersky and Lord (1984) show that there is a positive 
sampling correlation between estitnat .s of a and b when 
the item is easy, and a negative sampling correlation when 
the item is difficult^ For Sample 4, the least able 
sample, ail but one of the estimated a *s is too high for 



T 17 



IRT Preequatlng 
13 

easy items ( b < -i.O ). For Sample 1, the most able 
sample, all but one of the estimated a 's is too low for 
hard items ( b > +ii6 )i 
To sammarize: estimation errors found for extreme abilities 
are reflected in estimation errors for item difficulties. Because 
of the sampling correlations between item difficulty estimates and 
discrimination estimates, predictable estimation errors then occur 
for the item discriminations i 
Equating Results 

Of primary importance in this study is the analysis of 
equating results when item sets have been calibrated on samples of 
different ability. Figure 6 shows the results of IRT equatings of 
forms 3ASA3-2, 3ASA3-3, and 3ASA3-4 to form 3ASA3-1. The figure 
displays both the equating and equating residuals plots. The 
linear criterion equating is the 'true" equating of the intact form 
3ASA3 to the 206 to 80G score metric. 

Referring to the residual plots, it may be seen that for small 
differences in mean true ability for the calibration group, the 
impact on equating is really quite small, less than 5 scaled score 
points at all levels of raw scores. For the largest difference in 
mean true ability, the impact is greater. For higher raw scores • 
It can be as much as 15 scaled score points. 



18 



iRt Preeqoating 
14 

Insert Figure 6 about hera 

Using the frequency distribution of scaled scores obf-^lned 
when 'true* form 3Af)A3 was operationally equated, the top part of 
Table 3 summarizes the equating results in tirms of the scaled 
score means and standard deviations. From these numbers, it may 
be seen that small sample differences cause about a one-point 
difference in scaled score meansi The largest sample differe'^ce, 
from the least able sample, is about 5 scaled score points. 

Insert Table 3 about here 

How do these equating results compare with the real-data 
preequating results? The differences are striking: 

1) Differences in mean true ability of 1/3 to 2/3* s of a 
standard deviation have only a very slight impact on 
equating. The magnitude of the differences for 
real-data results was even larger than the differences 
seen for the least able sample. This sample had m-san true 
ability about one standard deviation below the most able 
sample. However, the real data contained no sample 
differences this large. Hence, this simulation cannot 
explain the real-data results. 



IB 



IRT Preequating' 
15 

2) The direction of the equating differences is exactly the 
opposite in the simulation from that found in real data. 
Here^ we find that 3ASA3-4^ calibrated on the least able 
sample, appears easier than it should. In real data, the 
preequating indicated a harder test, not an easier one. 
We can explain the equating differences found for our 
simulated data at least partially in terms of the item parameter 
eistimatioh errors previously described. It is clear that the 
difference in item parameter estimation errors for Samples 1, 2, 
and 3 have only a small impact on equating results. The impact on 
equating begins to become important only for the least able sample. 
Sample 4. 

Figure 7 compares the item parameter estimates from the least 
able sample with those from the most able sample. Sample 1. In 
these plots, different plotting symbols in one plot indicate the 
behavior of the parameter estimates in another plot. Estimates of 
the a 's for Sample 4 are higher than estimates for Sample 
1. This iis true since the a 's were uaderestimated from Sample 1, 
and overestimated from Sample 4. The mean estimated a for Sample 4 
is 1.05, while that for Sample 1 is .95. The estimates of item 
difficulty are not that different; the Sample 4 mean is -.01, while 
the Sample 1 mean is +.01. 



20 



IRT Preequating 
16 



Insert Figure 7 about here 



The resulting impact dti (equating is most easily seen in Figure 
8, which plots the test characteristic curves for all four 
simulated forms. Because of the bverestimation of the a *s in the 
least able sample, the test characteristic curve for 3ASA3-4 is 
shifted to the left of the others. For any value of true ability 
above .5, the number right true score will be higher on this form 
than on the other forms. Hence ^ form 3ASA3-4 appears easier for 
individuals of moderately high true ability than the other forms. 
There is little difference among the test characteristic curves at 
middle and low ability levels. 



Insert Figure 8 about here 



Thus, one can explain, at least partially^ the differences; 
found in the simulated equatings through differences in parameter 
estimation errors caused by different samples of true ability. 
Unf ortutiaizGly , this does not illuminate the real-data results from 
the previous studies. 
How Big^ Bad ? 

A separata aspect of equating differences can be explored 
using data from this simulation. It was previously observed that 



21 

o 

ERIC 



IRT Preequating 
17 

scaled score mean differences of up to 5 points can result from 
different samples i ficaled score mean differencias in the real-data 
study were up to 13 polntSo While smaller differences are better, 
how can one understand the importance of these differences? 

One method of evaluating differences is to compare equatinj^s 
where one set of item parameters is estimated and the other set of 
item parameters are considered to be the truth. Figures 9 shows 
equating results when forms 3ASA3-1, 3ASA3-2, 3ASA3-3i and 3ASA3-4 
are equated to *true* test form 3ASA3. The differences are quite 
large when compared to the corresponding equatings when item 
pa--ameters for both forms are estimated. The differences seen for 
form 3ASA3-1 indicate the magnitude that can be expected on the 
basis of what is predominately estimation error ^ since this form 
was taken by Sample 1 , whose mean true ability was the same as the 
definition of truth. Other equating differences result from a 
combination of estimation error alone and estimation error due to 
differences in abiiitiesi 

Insert Figure 9 about here 

The results are summarized in terms of scaled score means and 
standard deviations in the middle portion of Table 3. As a result 
of only estimation errors, a difference in mean scaled scores of 
about 2 scaled score points is observed. Equating errors from 



IRT Preequating 
18 

differences in estimation (errors resulting from differences in true 
ability can be higher, about 3 scaled score points • 

It is interesting to note that, although Sample 1 and Sample 4 
are about equally as far from the overall true mean ability in the 
calibration, the type of estimation errors made for these two 
outlying samples has a very different impact on equating errors. 
Sampleis 1, 2y ard 3 have about a 2- to 3-point increase in mean 
scaled score over true mean scaled score; Sample 4 has about a 
3-pbirit decrease in mean scaled score over true mean scaled score* 
Conclusions i'rbm the First Simulation 

Differences in mean true abilities can cause differences in 
equatihgs. For small differences in mean true abilities ^ these 
equatings differ by about what one would expect on the basis of 
estimation errors alone. For a large difference in true ability, 
the difference in equated means is about twice that. These 
equating differences are at least partially explainable oh the 
basis of the knowti magnitude and direction of estimation errors 
when samples differ in mean true ability. However ^ the direction 
of equating errors is opposite to that found in the previous 
studies with real data. 

Second Simulation ; Hultidimensionality in the Data 
From the results of the first simulation, it is clear that the 
explanation of poor preequating results found with real data does 



23 



IRT Preequatlng 
19 

not lie solely with the Imprecision of the estimation procedure. 
This second study was designed to explore th^ other category of 
potential problems in parameter estimation: the failure of the 
data to satisfy the anderiying assamption(s) of the particular IRT 
model used. 

The Eignor and Stocking (1986) results were reexamined, this 
time in terms of the abilities estimated by LOGIST for every sample 
of examinees that contributed to the calibration of pretest items. 
Table 4 shows the summary statistics for the real data used to 
preequate test form 3ASA3. Each sample is labeled, and the number 
of pretest items contributed by this sample is in parentheses by 
the sample designation. Samples are listed in decreasing order by 
median estimated ability. Percentile information is also displayed 
graphically in "box and whisker" plots in Figure idi 

Insert Table 4 and Figure 10 about here 

The use of these distributions to make inferences about 
distributions of true abilities is not strictly correct, since 
estimated abilities have different properties than true abiiitiesi 
In addition, the number of items on which an ability estimate is 
based differs by sample; hence, estimation errors will be different 
for each sample. However, a number of observations can be madei 



-,.24 



IRT Preequatlng 
20 

There are four pretest sample? that contribute over half of the 
pretest items that appear in 3'\SA3« They are designated 01613^ 
C1614i C2314^ and C2318. The inean estimated ability for these 
samples is about: ,2 to .4 standard deviations below the mean 
estimatied ability of the operational sample (3ASA3-Oper» ) • The 
standard deviations of estimated ability vary by at most .OS. It 
Is this kind of mean shift with no change in variance that the 
first simulation Was designed to study. It can be seen from the 
results of the first simulation that mean differences alone cannot 
account for the preequating results found with the real data. 

Of greater interest in Figure 10 is the comparison of the 
differences in the percentiles shown. Here one sees that the 
distributions of estimated abilities are distorted, not merely 
shifted, when compared to the distribution of estimated abilities 
for the operational form. For the four pretest samples 
contributing over half the items « the distributions are shifted 
lower when compared to the operational form, but the shift is 
larger at the 25th, SOth, and 75th percentiles than it is at the 
10th or 90th percentiles. 

These samples are supposed to be samples from the same overall 
population, although we have no Way of proving the truth of this 
assertion. It is possible, of course, that repeated samplings from 
the same population can give rise to such distortions. It is also 



25 



IRT Preequatlng 
21 

plausible thcit such distortions cati result from some mechanism that 
makes a unldlmenslbnal IRT model inappropriate for these data. 

It is not hard to advance hypotheses about circumstances that 
could introduce multidimensionality. Among the many possible ones 
are the effects of improved teaching methods on more recent samples 
of students, changes in emphasis and curriculum that took place 
between precest and operational administration^ and the ability of 
examinees to recognize and therefore have different motivation on 
prciiest sections. This latter situation could very well be 
applicable for the real-data results. 

In current SAT administrations ^ test sections that contain 
items that are being pretested are labeled in a manner that is 
indistinguishable from operational sections ^ and appear in 
different locations in different test booklets. This has not 
always been the case. Less than half the items in the final form 
3ASA3 were contributed by 12 pretest sections that had labeling 
that could be distinguished from that of bperatibhal sections; 
these pretest sections have designations beginning with X or Z. 
Mor^ than half the items in the final form 3ASA3 were contributed 
by 4 pretest sections that had indistinguishable labeling ^ but were 
always located in the same positions within the test booklets. 
Thus there is some reason to believe that any of the prestest 
sections contributing items to final form 3ASA3 could have been 
subjected to recognition and, therefore, motivational effects. 



IRT Preequatihg 
22 

In these studies » we focus our attention oh the four pretests 
that were administered in 'fixed' rather than 'variable' positions 
for three reasons s 1) these prestests contributed over half of the 
items to final form 3ASA3, 2) these pretests were administered most 
recently and therefore within the current isocial climate of 'test 
wiseness' encouraged by coaching schools, and 3) for students not 
posiseissing special information, a pretest section in a fixed 
position is probably easier to detect than a pretiest section having 
a label based on a distinguishable labeling scheme. 
The MuU44 imens 1 dn al^ Model 

McDonald (1982) provides a broad framework, based on nonlinear 
factor analysis, for the classification of unidimensional and 
multidimensional models. The particular model chosen here falls into 
McDonald's general category of nonlinear multidimensional models. 

In the particular model used in these studies, examinee 
responses to some items are generated using 3PL item response 
functions and a certain true ability. Responses to other items for 
the same examinee are generated using 3PL item response functions 
and a second true ability. The second true ability is related to 
the first through a discontinuous step function. 

This model in effect forces the 3PL model to hold for all item 
response functions but assumes that examinees respond to some items 
with one ability and to others with another. This is different, and 
therefore less familiar, than the multidimensional linear model 



27 



IRT Preequatlng 
23 

often used Iri IRT multldlinenslbnallty studies (see Drasgdw & 
Parsons, 1983, for example) but seems more Intultis^ely appealing 
in the present circums tances • 
Simulated Samples 

Using the true abilities, three new samples of N = 2500 each 
were drawn with no modifications to the true ability distribution. 
The 60 items in 3ASA3 were considered to be 'operational' form 
3ASA3-5. Two nbtloverlapping random subsets of items from 3ASA3 
were formed, each containing 30 items and designated as 3ASA3-5A 
and 3ASA3-5B. These two smaller subsets are to be considered as 
pretest items for equating purposes: each will be administered to 
different samples, and the resulting item parameter estimates will 
be combined to constitute a full 60-item test form. Using true 
parameters, responses were generated for simulees as follows: 

Sample 1 responded to the 24 items of equating section 

fn and test form 3ASA3-5. 

Sample 2 responded to the 24 items of equating section 

fn and the 30 items in 3ASA3-5A. 

Sample 3 responded to the 24 items of equating section 

fn with abilities satapled from the same true ability 
distribution as the other two samples. However, 
when responding to the 30 items in 3ASA3-5B, their 
true abilities were distorted. This was done in 
the following manner: 



28 



IRT Pfeequating 
24 

!• If true ability was less than or equal to -1, no change in 

ability was made. 
2. If true ability was between -^1 and -".S, the simulee 

responded with an ability equal to true ability minus .2. 
3m If true ability was between -,5 and +,5^ the simulee 

responded with an ability equal to true ability minus .A, 

4. If true ability was between ,5 and 1.5, the simulee 

responded with an ability equal to true ability minus .6. 

5. If true ability was above 1.5, no change in ability was made. 
These particular distortions were chosen to reflect 

distortions that might have caused the results observed for 
distributions of estimated ability from the real data. There are at 
least two intuitively appealing rationales that can be used to 
justify them. The first rationale runs along the following lines: 
individuals of low ability are not aware of clues that might change 
their motivation, so their behavior remains the same. As true 
ability increases, so does sensitivity to such clues and the ability 
to take advantage of them. Very able individuals, however, have no 
need to use such clues and continue to perform at the same high 
level as before. 

A second appealing rationale focuses oh targeted improvements 
in teaching and curricula. Individuals of very low ability are not 
in the targeted group. As true ability increases, the Improvements 
become more appropriate and have a larger inpact. Very able 



29 



IRT Preequating 
25 

Individuals, however, have no need of improved teaching or 
curricula since their ability Is so high that they will learn the 
appropriate material regardless of how poorly or well it Is taught i 

The results of the generation of samples of true ability are 
shown In the bottom portion of Table 5, For the third sample, only 
the distorted abilities used to generate responses to 3ASA3-5B are 
shown. The true abilities used for responses to equating section 
fri would be similar to the distributions shown for the first two 
samples. As can be seen^ the mean of the distorted abilities Is 
about 1/3 of a standard deviation below the means of the other 
two samples, and the percentiles are offset In a manner similar to 
that found In the real-data estimated ability distributions, 
although somewhat more exaggerated • 

Insert Table 5 about here 

Estimat -ioft of Item Parameters and Abilities 

As before, LOGIST was used to calibrate all Items and 
abilities Iti a single concurrent execution, with equating Items fn 
used as an anchor to set the scale. The N = 7500 and n = 144 
data matrix can be represented as follows: 




IRT Preequatlng 
26 



Items 


fn 


3ASA3-5 


3ASA3-5A 


3ASA3-5B 


People 


n=24 


n=60 


n=30 


n=30 


Sample 1 


X 


X 






Sample 2 


X 




X 




Sample 3 ] 


X 






X 



The above design produces two different sets of Item parameter 
estimates for the total 60-item test, one as part of 3ASA3-5, and 
the second as part of the combination of 3ASA3-5A and 3ASA3-5B. 
For Sample 3, where the true abilities differ for responses to fn 
items and 3ASA3-5B items ^ only one ability estimated is produced 
from the unidimensional IRT model. 
Scaling^ ^sM^tes 

As before, the results of this LOGIST calibration are not on 
the same scale ais the true item and person parameters. The same 
type of scaling transformation as used in Simulation 1 was repe^'ted 
here, using the estimated and true abilities for Sample 1. 
j^omparisQQ -of- E stimat eU and True Parameters 

Summary statistics for the estimates of item and person 
parameters after the scaling transformation are presented in Table 
5. The percentile comparisons among distributions of estimated 
abilities are graphically displayed in Figure 1. Figures 11 
compares the estimated item parameters and abilities with true item 



31 



iRT Preequat^ng 
27 

parameters and abilities for test 'form* 3ASA3-5 and Sample 1. 
Figure 12 compares only the estimated item parameters with true 
item parameters for the total test 'form' 3ASA3-5A+5B, constructed 
by combining the items from 3ASA3-5A and 3ASA3-5B. Ability 
estimates are not compared in Figure 12 since there are two true 
abilities for Sample 3. 

Insert Figures 11 arid 12 about here 

For the intact form 3ASA3-5, Figure 11 shows the a 's to be 
slightly underestimated, although the mean estimated a is the 
same as the meari true a • The b 's are very well estimated i 
The c 's are about as well estimated as one typically sees, as are 
the abilities. For the 'pretest' form 3ASA3-5A+5B, shown in 
Figure 12^ the a 's are slightly overestimated. The b *s are 
greatly overestimated; and the c 's are slightly overestimated. 

The explanation for the phenomena exhibited by the 'pretest' 
form is relatively simple. The individuals in Sample 2 respond to 
both fn items and pretest items with the same true abilityi 
However i most of the individuals in Sample 3 respond to fn items 
with one true ability, and to pretest items with a lower true 
ability. The number of items is roughly the same in both instances 
(24 for fn rnd 30 for the pretest). Thus LG6IST will, as much as 
possible^ produce an estimated ability for simulees in Sample 3 
that is somewhere in between tlie two true abiiitiesi This estimate 




IRT Preequating 
28 

will be higher than the true ability with which responses were 
generated to the pretest items. A person will get a pretest item 
incorrect more frequently than is expected on the basis of this 
ability estimate. Therefore, the estimation procedure behaves as 
if the pretest item is more difficult than it really is. The 
unidimensibnal estimation procedure is given incorrect information 
from the data as to the item location. 

Wingersky and Lord (1984) show that for middle difficulty 
items ^ the sampling correlation between estimated a 's and 
estimated b 's is positive. If the b 's are overestimated, the 
a 's will also be overestimated. Wingersky arid Lord also show 
that, for middle difficulty items, the samplirig correlation between 
estimated a and estimated c is positive. Thus^ if the a 's are 
overestimated, theri so are the c *s on average. 

Summary statistics are preserited for the estimated abilities 
in Table 5 arid Figure 1. It is interestirig to riote that the 
eistimatidri procedure produces estimated abilities for Sample 3 that 
are riot much differerit from those estimated for Samples 1 and 2. 
The difference iri true ability means disappears. Although there 
are still differerices in each percentile point recorded^ these 
differences are smaller thari those modeled with the true abilities. 
Part of this is due to the production of ability estimates for 
Sample 3 that lie between the two true abilities, but the extent of 
differences was a surprise. Because the model does not incorporate 



33 



IRT Preequatlng 
29 

two ability dimensions, the differential itein responses are reflected 
mostly in the estimated item difficulties, and not in the estimated 
abilities. As a result, these estimated abilities do not have a 
relationship across samples that is very similar to that seen for the 
estimated abilities for real data shown in Table 4 and Figure 10. 
Equating Results 

The impact of this type of simulated multidimensibnality on 
equating is seen in the top two plots in Figure 13, where equating 
and residual plots resulting from the equating of 'pretext' form 
3ASA3-5A+5B to operational form 3ASA3-5 are depicted. As expected, 
this type of lack of model fit has a large impact on equating. At 
some points on the raw score metric, the differences between scaled 
iscores is over 30 scaled score points. Table 3 shows that there is a 
difference of about 25 points in the scaled score means, as well as 
about a 7 point difference in the scaled score standard deviations. 

Insert Figure 13 about here 

These differences are much larger than the largest differences 
among scaled score means and standard deviations found with real 
datai There, the maximum difference between a preequating scaled 
score mean and a criterion mean was +13 scaled score points. The 
associated difference between scaled score standard deviations was 
+3.0 scaled score points. However, in contrast with the earlier 



34 



IRT Preequating 
30 

simaiation study, the mean difference found here is IN THE SAME 

DiRECTION AS THAT FOUND IN REAL DATA. 

The equating differences are explainable in terms of the item 

parameter miss--estitnations previously describea. Figure 14 compares 

the estimates of item parameters from the pretest form against the 
operational forSi Different plotting symbols are used to indicate 
whether an item comes from pretest 3ASA3-5A or 3ASA3-5B. As 
expected, parameter estimates from 3ASA3-5B, with the simulated 
multxdimensionality, cause the a *s and c 's to be slightly higher 
for the pretest form. Table 5 shows that the pretest mean a is 
1.01 compared to the final form mean a of .98; the pretest mean 
c is .15 compared to the final form mean c of .13. The item 
difficulties are substantially overestimated; the jsretest mean b is 
.26 while the final form mean b is .04. 

Insert Figure 14 about here 

The resulting impact on equating is most easily seen in Figure 
15^ which contains plots of the test characteristic curves for the 
two simulated forms i Because of the overestimatibn of the item 
difficulties, the test characteristic curve for the pretest form is 
shifted to the right of the final form. For true ability levels 
above -iiO, the number right true score on this form will be lower 
than on the final form. The pretest form appears more difficult 




35 



IRT Preequatlrig 
31 

for these examinees. Note, however ^ that for examinees with very 
low true ability, the pretest form actually appears easier. 

Insert Figure 15 about here 

Equating of Estimates Jt o True Values 

It is again Instructive to examine the equatlngs of each 
simulated test form to the true :est form. In this way, we can 
isolate and study estimation errors separately for each form. 

The bottom two sets of plots in Figure 13 show equating 
results when 3ASA3-5 and 3ASA3-5A+5B are equated to the true test 
form 3ASA3. The resultant mean scaled scores and standard 
deviations are shown in Table 3, The differences seen for form 
3ASA3-5 again indicate the magnitude of equating differences that 
can be explained on the basis of what is predominantly the 
imprecision of the estimation procedure alone, since the 
distribution of true ability for the sample taking this form was 
the same as the true distribution of ability. it is reassuring to 
note, through a comparison with plots in Figure 9, that the 
multidimensioriality simulated for items not contained in this 6G 
item set has a negligible impact on equating errors for this form. 
The bottom plotL in Figure 13^ depicting the equating of the 
simulated pretest form to the true form, demonstrate the impact on 
equating when the data do not fit the model. 



38 



iRT Preequatlng 
32 

Conclusions from the- Second Stmalattoa 

The multidimensionaiity modeled in this simulation was 
designed to reflect certain intuitively justifiable hypotheses. It 
is clear that when compared to results wJ.th real data, the model is 
greatly exaggerated^ It ±s also clear, from the resulting 
distributions of estimated abilities, that while this model may be 
a step closer than the first simulation to explaining real data 
results, it ±s by no means completer 

Jh±rd -Simulatloa ; Mean Bifj&r enc es and Mul t i d imens ionali 1 1 y 
The advantage of simulation studies Is that they can be used 
not only to Isolate phenomena of interest, but also that they can 
be used to study controlled combinations of such phenomena. The 
results of the second simulation study were dissimilar to real- 
data results in an important way: ';he relationship among the 
distributions of estimated abilities did not resemble very closely 
the relationships found with real data. This third simulation 
attempts to model the real-data results more faithfully, by 
combining the phenomena studied in the first two simulations. 
simulated Sample s 

Using the true abilities, three more samples of N = 250(3 each 
were drawn. The first two samples were drawn v:ith no modification 
to the true ability distributioni The third sample was drawn 
after decreasing the mean true ability by i35, as in the smallest 
mean decrease in the first simulation study. 



37 



IRT Preequating 
33 

The 6b items in 3ASA3 were considered to be Iritac^ form 
3ASA3-6, and were taken, along with equating section fn, by the 
first sample. The Bame random 30-1 tem subset as 3ASA3-5A Is 
considered here to be 3ASA3-6A, and was taken, along: with equating 
section fn, by the second sample. The remaining random subset of 
30 items, 3ASA3-5B, is considered here to be 3ASA3-6B, and was 
taken, along with equating section fn^ by the third sample. When 
the third sample responds to the 24 items in the equating section, 
it does so with average true ability decreased by .35. When the 
third sample responds to the 30-item 3ASA3-6B, the average true 
ability is decreased by .35 and then THE SAME distortion in true 
abilities as described earlier is rejseated. Note that this 
distortion is applied to the distribution of true abilities AFTER 
the mean true ability has been decreased. 

The results of the generation of samples of true abil*', -y are 
shown in the bottom portion of Table 6. For the third sample ^ only 
the distorted abilities used to generate responses to 3ASA3-6B are 
shown. The mean-shifted true abilities used to generate responses 
to equating traction fn would be similar to that of the other two 
samples except for the shift, and also to sample 2 from the 
first simulation study (see Table 2). The mean of the distorted 
true abilities is now about 2/3 of a standard deviation below the 
other true sample means, as opposed to the 1/3 obtained from 
distortion alone (see Table 5). 



\i 38 

o 

ERIC 



IRT Preequatihg 
34 



Insert Table 6 about here 



Estimation of Item Parameters and Abilities 



The LOGIST calibration of items and abilities is a single 
concurrent run^ as in the previous simulations. The design of this 
N 7500 and n = 144 calibration is the same as that for the 
second simulation. For completeness^ it is repeated here. 



Items 


fn 


3ASA3-6 


3ASA3-6A 


3ASA3-6B 


People 


n=24 


n=60 


h=30 


h=30 


Sample 1 


X 


X 






Sample 2 


X 




X 




Sample 3 


X 






X 



As before > two diff erent sets of 60-itein total test parameter 
estimates are obtained ^ one as part of 3ASA3-6j and the second as 
part of the combination of 3ASA3-6A and 3ASA3-6B. Only one 
estimate of ability is obtained for individuals in Sample 3. 
Scaling of Estimates 

As before^ the results of this calibration must be transformed 
to the scale of the true item parameters before any comparisons can 
be made. The same type of transformation as before was performed ^ 
tising the true and estimated abilities for Sample 1. 



IRT Preequatlng 
35 

Comparison of Estimates and True Parameters 

Summary istatlstlcs for the estimation of item and person 
parameters after the scaling transformation are presented in Table 
6 and Figure 1. As before^ the distortion of true abilities has 
caused the average estimated b on the 'pretest' to be larger than 
that of the intact form^ ,18 vs • 0« 0 • However ^ the dlf f erence in 
the averages is less than that seen in Table 5 for distortion 
alone ^ where the means were .26 and •04. The mean estimated a 's 
are identical as are the mean estimated c 's. 

Figures 16 and 17 graphically compare the estimated and true 
item parameters for the two 60-item forms, 3ASA3-6 and 3ASA3-6A+6B, 
A comparison of these figures with Figures 11 and 12 shows that the 
a 's for the both forvns are better estimated here, as are the 
c 's. As expected^ the estimated item difficulties for the intact 
form appear as well estimated as before; those for the pretest form 
are less overestimated. 

Insert Figures 16 and 17 about here 

The summary statistics f.r the estimated abilities in Table 6 
and Figure 1 show that the mean estimated ability for sample 3 is 
about 1/3 of a standard deviation below that of the other two 
samples, in contrast to the near equality seen in Table 5* In 
addition, the percentiles reflect the distortion in true abilities 
to a much greater degree. The differences among the distributions 



40 



IRT Preequatiiig 



36 



of estimated ability in Table 6> with mean shift and 
multidimerisibhality introduced, in contrast to Table 5, with 
multidimerisioriality alone, provide some intuition as to the 
behavior of the item parameter estimates. 

The third sample in each simulation takes two blocks of items, 
equating section fh in common with the other two samples, and the 
second block of pretest items. The only information for estimating 
the item parameters for this second set of pretest items comes from 
the third sample in each case. With multidimensionality introduced 
alone I the responses to items in equating section fri by thie third 
sample are equivalent to those from the other two samples, since 
all three are samples from the same distribution of true ability. 
Therefore, the multidimensionality introduced into the responses 
for the second block of pretest items is reflected in the item 
parameter estimates for those items alone ^ and is hot attributed 
to differences in ability. In contrast, when a mean shift in 
ability and multidimensionality are introduced, the third sample 
responds to equating section fri with a lower mean ability. 
Therefore the lack of success on the second block of pretest items 
iritroduced by the multidimensibriality can be attributed in part to 
the lbT7er mean true ability. Therefore, the item difficulties are 
less overestimated. 

Although somewhat more exaggerated ^ the relationships between 
item parameter estimates for the pretest and intact forms shown in 




IRT Preequatlhg 
37 

table 6 and Figure 1 replicate thoise fdutid with real data. In 
addition, the relationships among the distributions of estimated 
abilities shown in Table 6 and Figure 1 are similar to those found 
in Table 4 and Figure 10 for the four real-data pretest samples 
contributing the largest number of items to final form 3ASA3, 
Equating Results 

The impact on equating of a aaean shift in ability and the 
introduction of multidimensionality is shown in equating and 
residual plots at the top of Figure 18 • These plots depict the 
equating of pretest form 3ASA3-6A+6B to the intact form 3ASA3-3, As 
expected, the impact on equating is large, although not as large as 
that for multidimensionality alone. Table 3 shows that the impact 
has been reduced to about 22 scaled score points at the meati, in 
contrast to 25 scaled score points for multidimensionality alone. 

Insert Figure 18 about here 

As before, the equating differences are explainable in terms 
of Item parameter miss-estimations previously described. Figure 
19 graphically compares the two sets of estimates, with different 
plotting symbols indicating the membership of an item in 3ASA3-6A 
or 3ASA3-6Bi The previous remarks made in reference to the plots 
in Figure 14 are applicable here. 

Insert Figure 19 about here 



42 



IRT Preequatiiig 
38 

The test characteristic curves for the two forms are shown 
in Figure 20. While similar to Figure 15^ the curves are closer to 
each other, particularly for low ability levels. 

Insert Figure 20 about here 

Equating of Estimates to True Values 

The bottom two sets of plots in Figure 18 show equating 
results when 3ASA3-6 and 3ASA3-6A+6B are equated to the true test 
form 3ASA3* The resultant mean scaled scores and standard 
deviations are shown in Table 3. The differences seen for the 
intact form again reflect the magnitude of equating differences 
that can be explained on the basis of what is predominantly the 
imprecision of the estimation procedure alone. It is again 
reassuring that this equating is not contaminated by the 
introduction of a mean shift and multidimensiohality in the third 
sample. The bottom set of plots in Figure 18 demonstrates the 
impact on equating when both phenomena are introduced. 
Conclusions from the Third Simulation 

The introduction of a slight decrease in the mean of the true 
abilities in conjunction with a certain type of multidimehsioriality 
produces results that are consistent with thbise seen in real data. 
However, it is clear that the effects are exaggerated when compared 
to the real data results. Presumably this is because the 
multidimensionality was modeled for every individual with a 



43 



IRT Preeqaating 
39 

particular true ability in exactly the same way. A more 
realistic model would introduce this type of multidimensionality 
for only a certain proportion of individuals with the same true 
ability. It is likely that this modification would produce results 
that resemble real-data results even more closely. 

CONCLUSIONS 

The purpose of this study was to understand the impact of 
differences in true ability in a particular application that 
depends upon item parameter tnvariance: preequatingi Starting 
from reasonable hypotheses suggested by the real SAT-mathematical 
preequatihg data, three controlled simulations were conducted to 
test these hypotheses. The results clearly have implications 
beyond an understanding of a particular set of real data. They can 
be stated generally as follows: 

1. Differences in mean true ability can cause differences in 
the precision with which a particular estimation 
procedure estimates parameters, even when the data fit 
the particular IRT model used. The effect of this 
differential precision on preequating a test is 
relatively moderate. The particular differences in 
ability studied here produced the opposite effect on 
preequating than what was expected, based on the real 
data preequating, although this could have been 
predicted in advance. 



44. 



IRT Preequating 
40 

2. The introduction of a particular kind of 

multidimehsibnality in the data can have a large impact 
oti (estimation precision when the IRT model is 
unidiinensibnal. The computer program used here, LOGIST, 
reflects the impact of this type of multidimensionality 
mostly in the item parameter estimates ^ rather than the 
ability estimates. 

3. The combination of a slight decrease in mean ability and 

a particular type of multidimensionality in the data 
also has a large impact oh estimation precision when the 
IRT model is unidimerisional^ although the impact is 
lessened somewhat. This occurs because the lack of 
model fit is incorporated into the estimated abilities 
as well as the item parameter estimates, 
in keeping with the desire to understand the particular set 
of SAT-mathematical data that generated the need for these 
simulation studies, one conclusion can be stated more specifically: 
Based on the reasonable simulations studied here^ poor 
preequatings obtained for the particular set of SAT 
mathematical data werie consistent with a combination of a 
slight decrease in mean true ability^ and a particular type of 
multidimensionality introduced into specific pretest sections. 
Regardless of the causes to which one wants to attribute 
this multidimensionality, this conclusion appears 



45 



IRT Preequating 



inescapable^ Given sufficient time and money, it iis likely 
that further simulations could be devised that are even more 
consistent with real-'data results than those presented here. 
What are the implications of these conclusions for future 
efforts to capitalize on the invariance properties of true item 
and person parameters? There are at least three: 

1« The unidimensional IRT model parameter estimates produced 
by LOGIST are relatively immune to imprecisions due to 
small differences in true abilityi Differences as large 
as a standard deviation begin to have a greater impact, 
and the importance of that impact will clearly depend 
upon the particular application for which invariance is 
desired. Vertical equating applications ^ where 
differences may be as large or larger than those studied 
here^ should be approached with caution. 

2. If data do not fit the unidimensional model in the 

particular manner modeled here, L0(3iST provides some 
indication of this through the production of 
inconsistent results, e.g., the *failure^ of the 
preequating with real data. 

3. Greater efforts must be made both to ensure the data 

fitted with a unidimensional model are in fact 
uhidimensibnal , and to develop practical, useful ^ and 
informative multidimensional models for the futures 




IRT Preequatlhg 
42 

REFERENCES 

Drasgow, , & Parsons, €. K. (i983)i Application of 

unldlmensibhal item response theory models to multidimensional 
data. Applied Psychoiogical jfeasarement , -1, 189-199. 
Eignor, D. R. (1985). ^ Imte&tlgattoa x>l the leaslblTlty and 
practical outcomes of pre-eqiiating^ the SAT-verbal and 
mathematical sections (Research Report 85-10). Princeton, 
NJ: Educational Testing Service. 
Eignor, D. R. , & Stocking, M. t. (1986). Jai investigation of 
possible caases for the j:nadequacy -of- JRT ^re^e qua ting 
(Research Report 86-14). frinceton, NJ: Educational 
Testing Service. 
Kingston, N. L. , & Dorans, U. J. (1985). the analysis of item- 
ability regressions: Art exploratory IRT model fit tool. 
Applied Psychological Measurement , -9, 281-288. 
Lord, F. M. (1975). Evaluation jdLth artificial ^iata. o£ ^ procedure 
for estimating ability and item characteristic cut^ve 
paraii.. ars (Research Bulletin 75-33). Princeton, NJ: 
Educational Testing Service. 
Lord, F. M. (1980). Applications of item xesponse theory to 
practical testing problems . Hillsdale, NJ: Erlbaum. 



47 

o 

ERIC 



IRT Preequating 
43 

McDonald, R, P. (1982). Linear versus nonlinear models In Item 

response theory. Applied Psycholojglcal Measurement , 379- 
396. 

Petersen, N, S. , Cook, L. L. ^ & Stocking, M. L. (1983). IRT versus 
conventional equating methods: A comparative study of scale 
stability. Journal of Educational Statistics ^ 8^^ 136-156. 

Wingersky^ M. S* (1985). Personal communication. 

Wingersky, M. S., & Lord, F. M. (1984). An investigation of 

methods for reducing sampling error on certain IRT procedures. 
Applied Psychological Measurement , 347-364. 

Wingersky, M. S. , Barton, M. A. , & Lord, F. M. (1982). LOGIST V 
users guide . Princeton, NJ: Educational Testing Service. 



48 



IRT Preequatlng 
44 

Table 1 

Summary Statlistlcs for True Item and Person Parameters, 
Test Form 3ASA3 and Equating Section fn 





A £ — -^. km^W. — ^F^-jE^qing 1. 6 






A C X Q 


_ ^n_ 


Max a 


171 
1 • / 1 


1 • J J 


Mo On Q 

ncciri a, 


.70 


OA 


neaiaii a 


go 


. 


Mln a 


.30 


.43 


S.D. (a) 


.33 


.25 


n 


60 


24 


Max b 


2.33 


2.44 


Mean b 


-.01 


.17 


Median b 


.05 


.25 


Mln b 


-3.32 


-3.25 


S.D. (b) 


1.27 


1,27 


ti 


60 


24 


Max c 


.41 


.29 


Mean c 


.14 


.14 


Median c 


.13 


.12 


Mlti c 


0 


0 


S.D. (c) 


.10 


.08 


n 


60 


24 



True Abilities 













Percentiles 






N Mean 


SD 




Max 


10 


25 50 


75 


90 


2744 -.01 


1.03 


-7.35 


3.91 


-1.36 


-.66 .01 


.71 


1.24 



49 



IRT Preequatlng 
45 

Table 2 

Summary Statistics for First Simulation Study: Mean Shift Only 
Ail Parameter Estimates Have Been Transformed to the Scale of the True Values 



Ite m Pa rameter^^s^timates 



Test Form* 


3ASA3-1 


3ASA3-2 


3ASA3— 3 


3A<;A'i— 4 


f ri 
J. ii 




4_ 


2 


— - «^ 


4 






1.7? 


1.7? 


1.7? 

X a / ^ 


1 70 

X a / b> 


1 il 

1 a H 1 


mean a 


.95 


1.01 


.99 


1.05 


.96 


median a 


;91 


.97 


.92 


1.01 


.96 


min a 


;33 


.30 


.37 


.36 


.47 


S.D. (a) 


;32 


.33 


.34 


;33 


.27 


n 


6b 


6b 


60 


6b 


24 


max b 


2.67 


3.02 


2.91 


2.26 


2.62 


mean b 


.01 


.05 


.03 


-.01 


.19 


median b 


-.04 


.07 


-.04 


.03 


.25 


min b 


-2.96 


-2.95 


-3.38 


-3.53 


-3.0^ 


S.D. (b) 


1.28 


1.24 


1.28 


1.21 


1.24 


n 


60 


60 


60 


60 


24 


max c 


ikO 


.40 


.37 


;36 


.28 


mean c 


.12 


.13 


.12 


;i3 


.12 


median c 


.09 


.12 


.11 


,12 


.11 


min c 


0 


0 


0 


0 


.01 


S.D. (c) 


•ib 


.10 


.09 


.09 


.08 


n 


isd 


60 


60 


60 


24 



Abi l ^t^y ^ Es^ima^g 

Form 



Sample 


Taken <40^ 




— Meaii^ 


^& 


Min 


Max 


10 


25 


50 


75 


90 


1 


3ASA3-1(60) 


2498 


-0.01 


1.04 


-7.92 


4.27 


-1.27 


-0.65 


-0.02 


0.69 


1.31 


2 


3ASA3-2(60) 


2498 


-0.31 


1.01 


-7.92 


4.10 


-1.53 


-0.92 


-0.28 


0.36 


0.90 


3 


3ASA3-3(60) 


2500 


-0.67 


1.05 


-7.92 


3.49 


-1.83 


-1.25 


-0.67 


-0.02 


0.59 


4 


3ASA3-4(6b) 


2500 


-0.98 


0.99 


-7.92 


2.60 


-2.15 


-1.58 


-0.96 


-0.34 


0.20 



Tr^e Abilities 



Form 



Sample 


Taken <n) 


N 


Mean 




M4n 




li 


25 


50 


75 


yb 


1 


3ASA3-1(60) 


2500 


-0.02 


1.02 


-7.35 


3.85 


-1.34 


-0.67 


-0.03 


0.71 


1.26 


2 


3ASA3-2(6b) 


2500 


-0.35 


1.00 


-3.78 


3.56 


-1.67 


-0.97 


-0.32 


0.35 


0.86 


3 


3ASA3-3(60) 


2500 


-0.72 


1.02 


-8.05 


3.15 


-2.04 


-1.37 


-0.73 


0.01 


0.56 


4 


3ASA3-4(60) 


2500 


-1.07 


i.bi 


-5.20 


2.80 


-2.36 


-1.75 


-1.08 


-0.35 


0.18 



ERIC 



IRT Preequatlng 
46 



Table 3 

Scaled Score Means and Standard Deviations Resulting 



f roin 


EdUatinffs ulth 


55imiil a^pf^ T)htfi 






Scaled Score 

Meati 


Scaled Score 
Standard 
Deviation 


Scaled Score 
Mean Minus 

True Scaled 
Score Mean 


3ASA3 (true) 


485 


113 




3ASA3-2-^3ASA3-l 
3ASA3-3-^3ASA3-l 
3ASA3-4-^ 3ASA3-1 


486 
486 
480 


112 
112 


1 

1 


3ASA3— ^A^A'^ f'^f^1oS 
3ASA3-2-^3ASA3 (true) 
3ASA3-3-^3ASA3 (true) 
3ASA3-4-^3ASA3 (true) 


to / 
488 
488 
482 


111 

111 
110 
110 
106 


2* 
3 

3 
-3 


3ASA3-5A+5B^ 3AS A3-5 


510 


120 


25 


3ASA3-5>3ASA3 (true) 
3ASA3-5A+5B>3ASA3 (true) 


487 
512 


112 
119 


2* 

27 


3ASA3-6 A+6B^ 3ASA3-6 


507 


120 


22 


3ASA3-6^3ASA3 (true) 
3ASA-6A+6B+3ASA3 (true) 


487 
509 


111 
119 


2* 

24 



*Difference Is due predominantly to errors of estimation. 



er|c 



51 



IRT Preeqaating 
47 

Table 4 

Summary Statistics for Estimated Abilities from Real Preequatlng Data 
for Test Form 3ASA3, Sorted by Median Estimated Ability 

-- Estimated Abilities* 



Percentiles 





N 


Mean 


S.D. 


Min 


Max 




25 


50 


75 


90 


X316(2) 


2704 


.19 


1.01 


-7.33 


3.56 


1 

-.95 


-.35 


.24 


-.82 


1.32 


X313(l) 


2795 


.17 


1.03 


-7.33 


3.49 


-.91 


-.38 


.23 


.76 


1.30 


3ASA3-0per. 


2772 


.16 


.97 


-4.19 


3.18 


-1.14 


-.43 


.22 


.82 


1.33 


X233(3) 


2490 


.14 


1.04 


-7.33 


3.89 


-1.21 


-.54 


.19 


.84 


1.43 


X226(l) 


2561 


.13 


1.03 


-7.33 


3.86 


-1.15 


-.54 


.17 


-.82 


1.39 


X232(2) 


2522 


.15 


1.04 


-7.33 


3.91 


-1.13 


-.51 


.17 


.83 


1.45 


X241(2) 


2493 


.14 


1.00 


-4.73 


3.50 


-1.11 


-.51 


.16 


.84 


1.36 


X243(4) 


2489 


.14 


1.03 


-7.33 


3.64 


-1.13 


-.49 


.16 


.78 


1.39 


X405(l) 


2514 


.06 


1.21 


-7.33 


3.20 


-1.31 


-.60 


.14 


.82 


1.41 


X234(3) 


2458 


.10 


1.04 


-7.33 


3.65 


-1.15 


-.54 


.12 


.76 


1.41 


X415(1) 


2828 


-.14 


1.25 


-7.33 


3.25 


-1,57 


-.77 


-.02 


.66 


1.25 


Z515(l) 


2513 


-.17 


1.34 


-7.33 


3.45 


-1.55 


-.79 


-.07 


.59 


1.26 


C2318(y; 


2727 


-.06 


.98 


-7.33 


3.05 


-1.22 


-.67 


-.10 


.55 


1.21 


C2314(10) 


2619 


-.10 


.98 


-3.99 


3.18 


-1.35 


-.76 


-ill 


.55 


1.13 


Z512(3) 


2616 


-.22 


1.23 


-7.33 


3.51 


-li54 


-.84 


-.17 


.51 


1.15 


C1613(10) 


2963 


-.26 


1.00 


-7.33 


4.19 


-1.44 


-.88 


-.26 


.41 


.95 


C1614(7) 


2883 


-.27 


1.02 


-7.33 


3.09 


-1.48 


-.92 


-.27 


.39 


.99 



*These ability estimates are on a different scale than those contained in all 
other tables. 



**The numbers in parentheses are the number of items contributed by this sample 
to the total of 60 items in test form 3ASA3. 



erJc 



52 



IRT Preequatlng 
48 

Table 5 

Sammary Statistics for Second Simulation Study: Distortion Only 
Ail Parameter Estimates Have Been Transformed to the Scale of the True Values 



ItenuPa^rame t^r- E&t i mat^ s 



Test Form: 


3ASA3-5 


3ASA3-5A+5B 


3ASA3-5A 


3ASA3— SB 


fri 


Sample : 


1 


2 and 3 


2 


3 


O.JLJL oalU^-LCo 


max a 


1.75 


1.74 


1.74 


1.74 


1.38 


mean a 


.98 


1.01 


1.05 


.97 


.94 


median a 


.94 


.90 


.95 


.80 


.92 


min a 


.35 


.32 


.37 


.32 


.46 


S.D. (a) 


.32 


.37 


.38 


.35 


.26 


n 


60 


60 


30 


30 


24 


max b 


2.64 


3.07 


2.75 


3.07 


2.64 


mean b 


.04 


.26 


-.07 


.59 


.19 


median b 


-.02 


.35 


.14 


.63 


.24 


min b 


-3.36 


-3.49 


-3.49 


-2.80 


-3.15 


S.D. (b) 


1.27 


1.38 


1.39 


1.32 


1.28 


n 


60 


60 


30 


30 


24 


max c 


.47 


.38 


.38 


.38 


.27 


mean c 


.13 


.15 


.14 


.15 


.12 


median c 


.12 


.14 


.13 


.14 


.10 


min c 


0 


0 


0 


0 


0 


S.D. (c) 


.11 


.09 


.09 


.10 


.08 


n 


60 


60 


30 


30 


24 



Ability Estimates 

Form 



Sample 


Taken: (jO 


N - 


Mean — 




Min 


Max 


10 


25 


50 


75 


91): 


1 
2 
3 


3ASA3-5(60) 

3ASA3-5A(30) 

3ASA3-5B(30) 


2496 
2496 
2499 


0.03 
0.03 
0.04 


1.05 
1.07 
1.10 


-7.29 
-7.29 
-7.29 


4.11 
3.80 
4.41 


-1.25 
-1.29 
-1.19 


-0.65 
-0.66 
-0.59 


0.01 
0.03 
0.02 


0.71 
0.70 
0.66 


1.35 
1.34 
1.30 



True Abilities 

Form 



Sample Taken (n) 


N 


Mean 


SD 


Win 


Max 


10 


25 


50 


75 


90 


1 3ASA3-5(60) 

2 3AsA3-5A(30) 

3 3ASA3-5A(30) 


2500 
2500 
2500 


0.01 
0.01 
-0.33 


1.02 
1.02 
0.92 


-4.15 
-4.15 
-4.15 


3.85 
3.85 
3.85 


-1.32 
-1.32 
-1.32 


-0.65 
-0.65 
-0.88 


0.01 
0.01 
-0.39 


0.71 
0.71 
0.11 


1.30 
1.30 
0.70 



53 



IRT Preequatlilg 
49 

Table 6 



Suffitnary Statistics for Third Simulation Study: Mean Shift and Distortion 
All Parameter Estimates Have Been Transformed to the Scale of the True Values 







Item Patametet Eistimates 






Test Form: 


3ASA3-6 


3ASA3-6A+6B 


3ASA3-6A 


3ASA3-6B 




Sample: 


1 


2 and 3 


2 


3 


all samples 


max a 


1.73 


1.78 


1.78 


1.78 


1.37 


mean a 


1.00 


1.00 


1.02 


.98 


.95 


median a 


.96 


.93 


.95 


.84 


i94 


tnlh a 


.30 


.25 


.35 


.25 


.42 


S.D. (a) 


.34 


.40 


.38 


.42 


.25 


h 


60 


60 


30 


30 


24 


max b 


2.47 


2.63 


2.38 


2.63 


2.40 


mean b 


0 


.18 


-.13 


.49 


.17 


median b 


-.08 


.29 


-.20 


.70 


.22 


ffilii b 


-3.37 


-3.84 


-3.84 


-2.70 


-3.38 


S.D. (b) 


1.28 


1.37 


1.40 


1.27 


1.30 


h 


60 


60 


30 


30 


24 


max c 


.46 


.44 


.42 


.44 


.27 


mean c 


.13 


.13 


.13 


.13 


.12 


median c 


.13 


.11 


.11 


.12 


;io 


inlri c 


0 


0 


0 


0 


0 


S.D. (c) 


.10 


.10 


.09 


.10 


;67 


n 


60 


60 


30 


30 


24 



Ability Estimates 



:_ Form 



Sample Taken (ri) 


N 


Mean 


SD 


Mln 


Max 


10 


25 


50 


75 


90 


1 3ASA3-6(60) 

2 3ASA3-6A(30) 

3 3ASA3-6B(30) 


2499 
2499 
2498 


0.04 
0.02 
-0.28 


1.03 
l.Il 
1.07 


-7.18 
-7.18 
-7.18 


4.05 
3.85 
3.83 


-1.21 
-1.26 
-1.46 


-0.61 
-0.66 
-0.81 


0.01 
-0.01 
-0.29 


0.70 
0.70 
0.34 


1.34 
1.37 
0.89 








True Abilities 












Form 

Samp^le Taken (ri) 


N 


Mean 


SD 


Mlh 


Max 


10 


25 


50 


75 


90 


1 3ASA3-6C60) 

2 3ASA3-6A(30) 

3 3ASA3-6A(30) 


2500 
2500 
2500 


0.02 
0.02 
-0.62 


1.02 
1.02 
0.88 


-7.35 
-7.35 
-7.70 


3.91 
3.91 
3.56 

54 


-1.29 
-1.29 
-1.64 


-0.62 
-0.62 
-1.07 


0.01 
0.01 
-0.72 


0.74 
0.74 
-0.07 


1.25 
1.25 

0.30 



erJc 



IRT Preequatirig 
56 



Sni MRTHEMHTICHL SIMULRTED D9Tfl 
FREQUENCY DISTN'S OF ESTIMRTED ABILITIES 

. i i 1 ^ iHSfli-1 

. I 1 I . 3flSfl3-2 

I I 1 3RSR3^3 

. 3flSfl3-4 



3fl5fl3-5 

3RSR3=Bfi 

3R5fl3-5B 



3R5R3-6 

3RSR3-BR 

3RSR3-5B 



-2.5 -1.5 -r.5 0 .5 1.5 



Figure 1: Schematics of summary statistics of distributions of estimated 
abilities for all simulated samples. 

First simulation study: 3ASA3-1, 3ASA3-2, 3ASA3-3 , 3ASA3-4 . 
Second simulation study: 3ASA3-5, 3ASA3-5A, 3ASA3-5Bi 
Third simulation study: 3ASA3-6, 3ASA3-6A, 3AsA3-6Bi 



55 



ERIC 



IRT Preequatlng 




lower asymptote (c) - abilities 




Figure 2: First simulation study: Parameter estimates for Sample 1 taking 
form 3ASA3-1 vs. true parameter values. 



5g 



IRT Preequating 



icea ditcrialnacion (a) 




crue 



52 

itein difficulty (b) 




true 



* ^ * BH*T LT B 
^ ^ BH*T UT 6- 
OOO BMAT B 

' ' - BM^T 6E P, 

* * ■* BM*T Cf B 
H I f- SNA T CE B 



LT -J 
r LF. B 



LT- 



Lf *i 
LC *l 



AHAT CC 



lover asympeoEe (c) abillcy 




Figure 3: First simulation study: Parameter estimates for Sample 2 taking 
form 3ASA3~2 vs. true parameter values* 



EKLC 



57 



item ditctlffllnatlon (a) 




^ * • BM*r LT 8 8 LT -t 

2!^*^ B LC M 

OQO BH*T LT B. B TT *l 

■ * • BMAf CF B. B L' -I 

♦ ♦ BH4T CF P. -( L£ B LP H 



iRT Preequating 
53 

item difficulty (b) 




19 1 « 



O O O AHAT LT A 
* * * ANAT CC A 



l ower «s>TBpto te (c) ab il i c y 




Figure 4: First simulation study: Parameter. estimates for Sample 3 taking 
form 3ASA3-3 vs. true parameter values. 



5a 



Icen dlscrlnlnAClon (a) 




IRT Preequating 
54 



lt«a difficulty (b) 




*• • -J ■ .< s 



1 » 11 



• • • BMAT LT B. B LT-t 

O J? P BH^r LT B. -I ce o CE M 

• • • nuA r cr B . 8 t r 

• * ♦ BMAt CE fl -» tr. B LF *j 

+ + -t- BMA r rc 0 D C.T * I 



O O O AH>\r tt A ' 
* ♦ ♦ ^HAT CE A 



lover asymptote (c) 




ability 




-» -t 



Figure 5: First simalatlon studyi Parameter estimates for Sample 4 taking 
form 3ASA3-4 vs. true parameter values. 



59 



ERIC 



Eqaating Plot 



IRT Preequatlng 
55 

Residual Plot 



800 




«XT ISP -I? -II -? -t 3 i 13 il 



I L 



J L. 



Z32t3338 43 4t939tC3 



5 30r 

p to 



3ttSA3"2 E eWTED^TO 3ftS03-l 



D 
1 
F 
F 

E -5 



> 3ASA3-2 •qu«Ccd Co 
. JASA3-1 

llsair (erlcarlon) 



1 1—1—1 I 1 I t 



CBCT m ID w ID n M 



CtIT ll» -17 -12 -? -2 ^ ^ 13 18 23 » 33 39 43 48 S3 99 63 
CRIT 8> 20 2D26Zl2l323i;4i 49 4fSI9iGC<7l75 

• » OXFrDmZ - tS U»C - OltTtJtlON LINE 



OUT 



_3flSft3-3 EOUOTEP TO 3ftSft3-l 




D 
1 
F 
F 
E -S 

»-2S 



--3ftSfl3*3 EQUATED TO 3ftSft3-l 



JASAI-** cquatad to 
USAJ-i 
, llnaar (cricsrlbfi) 



CHIT «f -t7-12 "7 -2 
C»!T S 20 20 2D n 



13 la 23 28 
3S 41 45 49 



1 r I L I t 1 L i 



33 38 43 48 £3 
S3 98 C2 66 71 79 



99 63 



< SS OXFFDSMOE • tt LtNl - CKXtEXXON LINE 



000 



3ft«3-^t EQUftTED TO 3WSI^-|- 




lASA3-4_«iiuat«<i to 

3AaA3-l 



i««r (erlCttrlon) 



80 iO 80 n 



-• 13 18 23 28 33 31 43 4« SI 38 a 

»x4t4a48ns8a'wn?s 



■T 

10 

9 

0 
-9 
'ID 
19 

2D - 



SASAS'^ EOUftTED TO 3ftSA3-l 



»-2S 



CKir n 

CHIT SS 



J -h 



r r ir 1 



- - - 3ASA3--^4-«qu«c«d to 
3A5A3-1 

linear (erltarlon) 

\ 4. 



-17-12 -7-2 3 8 13 18 23 28 33 38 43 48 S3 96 63' 
20^ a 20 23 28 32 36 41 49 4f S3 99 62 66 71 79 

» SS OtFFEXOCC « tt LtNC - CRITEXION LtIC 



Figure 6. Eqaating and equating residuals for the first simulation study. 

Forms 3ASA3-2, 3ASA3-3, and 3ASA3-4 are equated to Form 3ASA3-1. 



60 

o 

ERIC 



IRT Preequatlng 



56 

Itra difficulty (b) 




3ASA3-1 dfltliutftd 3ASAj-l «iitiauited 



O O O • 0-« ut • AS^i O O O A 0-« Lt A A3^ I 

B AS-« OS B 1 ^ > A A3-« CC A A3- 1 



lowar fliynpcoc* (c) 




3ASA3-1 cfltimatcd 



Figure 7. First simulation study : A comparison of 3ASA3-4 estimated 
parameters vs. SASAS-l estimated parameters. 



61 



IRT Preequatlng 



57 




Equating Plot 



IRT Preequatlhg 
58 

Reslduai Plot 



800 



3A5A3-1 EOUftTEO TO 




lin«r«r fcri:«riort> 



iQO t i>^i I ^ 4 j t- t I- -» 4: -i i i- I- 1 j 

C3PIT "17 -U -7 t2 3 » 13 19 23 29 33 38 43 48 53 36 63 
emu at »aDa0232a32X4i4S4»53SSC2CS7l7S 



3ftSft3"2 EOPQTEO Tp 3«W3 TRUE 




WIT ti° -17 12 -7 -2 
ORtr CS 20 20 20 23 



aa 12 3c 



13 16 23 26 33 38 43 49 53 56 63 
4t 4S 4« 53 9ft 6 S 7t 73 



O 
J 
F 
F 

e 

E-20 

•-23 
-30 



"05rt3-l E3l*''E0 TO 3A5aj TRUE 



J I u i J 4 t U -I 4 4 -I i I X X X 



mt RS -17-12 -7 -2 3 B 13 19 23 29 33 3S 43 48 53 58 63 
CRIT SS 20 20 20 23 28 32 36 41 43 49 S3 58 62 66 71 75 



< SS OIFFEJCNCC 



SS LtfC 



CRITERION LINE 



S 30 
S 



3ASP3-2 EOUOTEC 70 3«S03 TPUC 



F 0 
I-- 

i-io 

?-l3 
E-20 
«-23F 
-30 



' i 4 4 X 11 1^ 4^ 4- 4^ 



4^ 1^ 4^ 



Cfilt AS -t7-_I2 -7 -.2 3 9^ 13 19 23 2S 33 38 43 48 53 58 63 
CRIf SS 20 20 20 23 28 32 36 41 43 49 33 56 62 66 71 73 



t SS DIFFERENCE • SS CINE - CRITERION LI« 



800 



_ ?ASmj"? EOLIo^EO TC 3PSa3 TftUE 




CRIT fts -17 -12 -7 -2 3 6 13 
ORIT « aD202021S32X 



16 23 28 33 38 42 48 53 58 63 
41 45 S3 Sft C2 C6 71 73 



c - - 
' 25 

D ,0 



30Sft3-3 EQOfiTEO TO 3aSa3 TRUE 



F 0 
F 

E -3 



-30 





























t - i i- u 4 4 


4- i- 4: 4- 


4 4- 1 4 U 4 i 



CTiT RS -17 -12 -7 -2 3 6 13 18 23 28 33 38 43 48 33 ?8 63 
GRIT SS 20 20 20 23 26 32 36 41 43 49 53 56 62 66 71 73 

< SS OIFFEKENCC - SS LI»C - OdTERlON LI»C 



Figure 9. Equating and equating residuals for first simulation study. 

Forms 3ASA-1, 3ASA-2, 3ASA3-3, and 3ASA3-4 are equated to 
true form 3ASA3. 



S3 



ERIC 



IRT Preequatlng 



59 

Equating Plot RestdiiaJL Plot 




t SS DIFfCREHCC « 5S LINE - CWTERIOH LINE 



Figure 9 (continued). 



B4 



IRT Preequating 
60 



SRT MflTHEHRTICRL PRE-EQURTING DRTR 
3RSfl3 OPERRTIONRL AND PRETEST SAMPLES 



-2. 5 



-1.5 



0 



. X31B 


- 2 


. XSIS 


- 1 


. 3RSR3 


OP 


. )(233 


- 3 


. X226 


- 1 


. X232 


- 2 


. X241 


- 2 


. X243 


- 4 


. X405 


- 1 


_ X234 


- 3 


X415 


- 1 


Z515 


- 1 


G2318 


- 9 


C2314 


-10 


Z512 


- 3 


G1B13 


-19 


C1614 


- 7 



1. 5 



Figure 10: Schematics of suSiK-ory : 
abilities for all rret;^,r v:. 
for SAT matheirdtical forr- ^.^:SA. 



of dii5;t.r*I>utioiis of e^cjmated 
and O'-;- operational ssr;;±e 



B5 

o 

ERIC 



lEea diacrlraliiatiioh (a) 




IRT Preequatlxig 
61 

Itea difficulty <b) 




666 



• ■HAT LT a. B LT -I 

* mifir CT B. -I Li-i Li +1 

»«AT Lt ». B CT +1 
BHAT CE B, B LT -I 
BHAT CE B. >i LE B LE 4l 
BHAT CC B, B CT 4t 



O O O AHAT LY A 
♦ ♦ ♦ AHAT Ce " 



lower BoympcocB (c) 



ability 





-4 -9 -« -J O 



CriiB 



Figure li: Second simulation study: Parameter estimates for Sample 1 taking 
form 3ASA3-5 vs. true parameter values. 



66 



IRT Preequiatlng 



62 



tt<m dlscrlainatlon (a) 



item difftcuity (b) 





# • SI 



- • • BHAT LT B. B LT -1 

O O O BhAT LT B -I.LC B LC 4 1 

OQO tJfc''*' t.» B B Of ♦! 

• • • BHAr ec d B L r -1 

«■ ♦ CC B -I LE B LE: 4 1 

•+• -4- BMAl Ct; B h ♦« 



O O O 4HAT LT A 
♦ ♦ AMAt CE A 



lower avynptote <c) 




Figure i2« Second simulation study: Parameter estimates for Sample 2 and 
3 taking combined form 3ASA3-5A+5B vs» true parameter values. 



67 



IRT Preequatlng 

63 



Equati->g Plot 



Residual Plot 




CRM 48° -»7-J_2 -7 -2 3 8 

c»ir Gc an 3D ao aa as t: 



13 18 23 ^_ 33 36 43 46 53 W 63 

ic «t 4S n » <a cc 7t 7s 




CRIT 48° -17-12 -7 -2 3 8 13 18 23 26 33 ^ 43 48 53 58 €3 



s 30| — 

gl5- 
0 10- 



-A»-<»»a CaUftTED TO ft3g 



D 

F 

F 

E - 
g-I5. 

^3D 



- 3ASA3-SA4^SB vquaccd Co 
iASA3-S 

Itnvar (crtCarion) 



CKIT ss 



■ i i- I— I -t 



a ai a a A s 



n 18 n ti 

X 4i 49 4f S 



H t 1 I H ^ 



4J 48 53 98 C3 

Ci 71 7S 



t 88 OlFmDCS • 88 LiNi - ORITWION LINi 

3»^S«3"5 EQLWTEO TO lASfil '"PUt 



?, :'0 



? ' 

F 0 
F - 
E -5 

^10 



-20 



»-25 - 



•ASA ♦ 



cq*iir«d to 



<rt: er i.>n 



J I I ■ I- i I 



i t t f I I 



CRIT es -17 -12 -7 -2 3_ 6 13 19 23 26 33 38 43 48 53 58 63 
tXlt SS 2Q 20 20 ^3 28 32 36 4t 45 49 93 58 C2 66 71 75 

S SS OIFFERCNCC * SS LiHE - OdTERtOH LI^ 



-3ftSti3>5A'>!55 EQU'V^'er TO .i >S«3 TRUE 



CRI 
MIT 




"7 -2 3 8 13 18 23 26 33 38 43 48 53 S8 63 
aQaoaoa32ax2SC4t«s49<i3«i<2cc>t75 



0 
N 
ft 
N 
P 

B 
I 
F 
F 

E -5 

I-.0 
N 



-IS 
-20 h 



3 88H 3 w i3> eawnp to mm noE 



- 
- 


• 

• 

r 

— « 












> • • - 3ASAl!-i4-f3i iquiCad to 




3ASA3.3 cru« 




lln««r (critttrion) 












U ^ ^ ^ ^ ^ X X I 



CXir B~ -17-12 -7 -2 J a 13 18 0 2i 33 « 41 4« SS 98 O 
CRfT 85 »tOao»»3336 4l4a493l98ttCC717S 



8 88 01 



> 88 UHI - CMtmON UNi 



Figure 13. Equating and equating residuals for the second simulation study. 

Forms 3ASA3-5A+5B equated to 3ASA3-5 (top) ; Form 3ASA3-5 equated 
to true form 3ASA3 (middle); form 3ASA3-5A+5B equated to true form 
3ASA3 (bottom). 



IRT Preequatirig 



64 



Itam disc rittlnat Ion (a) 



Item difficulty (b) 




3ASA3-S iitija^tad 




FHOM OTMEA >-i^E TESTS 
FROM 3ASA3-S0 



* * * FHOW OTHCR PRETESTS 
OOO 3ASA3-5B 



lover aaymptote (c) 




3ASA3-S cutl»ttCcd 



6ff 



Flgurp, Serond 8ltnula*:idn study: A comviariaon of 3ASA3-5A+5B estimated 

par^ameters ve, 3ASA.V5 eGtltnated parameters. 



6S 



iRT Preequatliig 
65 




IRT Preequatlhg 
66 



ifcvn diacrlmlnaElon (a) teen dlfflculEy (b) 




666 



|H*T I 0 i.1 COO Ah>v' 



a B i^F 

a B lt 

— _ _ . _ e. 'I L 

•I- -I- •4- BMAT CC B. B CI 



AHA I 



^ a B Lr - 1 




Figure 16: Third simulation study: Parameter estimates for Sample 1 taking 
form 3ASA3-6 vs. true parameter values. 



7i 



IRT Preequatltig 
67 

if n giicrlTolnctlon (a) _ itcn difficulty (b) 




+ +•-♦- l»MAT CE c! D CT ♦« 



lower asymptote (c) 




'/ 




true 



:ure 17. Third simulation study: Parameter estimates for Sample 2 and 
3 taking combined form 3ASA3-6A+6B vr . true parameter values. 



n 



IRT Preequating 
68 



3A5A3.^A-»6B EQUATED TQ 3AaA3 " » 




I 4 I 



CHIt RS -i 7 -12 -7 -2 3 b 13 T9 23 28 33 Iff 43 46 33 9» 63 
CRir SS 20 20 20 23 26 32 36 41 49 4» 53 J» ft 7| 7S 



S 30 

S - _ 
2S 

N 20 

" IS 

N 'r 
° '° 

D 9 
I 

r 0 

F - 

e -s 

R 

E-TO 



Residual Plot 



3ASA3-6A»6a ECXJATEP TO 3ASA9-6 



- aASAa>6A>6B aquatad to 



Ilnaar <crit(irlon) 



-L-l I I I L 



' 1 « 



CHIT RS -L7 -13 -1 -2 A 6 13 18 23 28 33 38 43 48 33 96 63 
CRZT SS 20 20 20 23 26 32 36 41 49 49 93 98 82 66 7 1 79 



• SS OIFFERENCE - SS tiNE - CRITERION LINE 



lASAl-^ EQUATED TO S.ISaJ fRUS 




__ -1 7 -12 -7 -2 3 8 »3 18 23 28 33 36 3 46 S3 38 63 

CRIT SS 20 20 20 23 28 32 34 41 49 49 93 98 82 86 7 1 79 



3 ASA 3 TRUE 




CRir RS -»7 -12 -7 -2 3 8 M 16 23 28 33 38 43 46 93 98 63 
CRIT 3S 20 20 20 23 28 32 36 4t 49 46 93 98 82 66 7 1 79 



0 
I 
F 

r 

t -5 

?-to 



3ASA3~6 EQUATEb TO 3 AS A3 TRUE 



aASA3<6 aquAtad to 
_ _3ASA3 _eru« 

lln««r (crltirion) 



-U-JL 



-I — r t > > > -t 



CRIT RS -W -12 -7 -2 3 6 13 18 23 28 33 38 43 46 93 96 63 
CRIT SS 20 2b 20 23 28 32 36 41 49 49 93 98 62 66 71 79 

• SS OIFFERENCE m SS LINE - CRITERION LINE 



E -3 
?-,0 



E-20 
• -29 



3ASA3"6A-HiB EQUATED TO 3 AS A3 TRUE 



SASAa-uA-^B squat ad CO 
3ASA3 _tru« 

llnaar (crltaribn) 



A I I I .1 L_±: 



CRIT RS -'J^ -<2 -7 -2 3 8 13 18 23 28 33 38 43 48 93 98 83 
CRIT SS 20 20 2b 23 28 32 36 41 49 49 93 98 62 66 7 1 79 

• SS birFE«iENCE - SS LINE - CRITERION LINE 



Figure 18. Equating and equating resldualp for the third slmalatlon studvi 
Formis 3ASA3-6A+6B equated to 3ASA3-6 (top); Form 3ASA3-6 equated 
to true form 3ASA3 (middle); form 3ASA3-6A+6B equated to true form 
3ASA3 (bottom). 



EKLC 



73 



IRT Preequatlng 
69 

Itam di«cr£oin«tl6n (•) - icem difficulty (b) 




lower asymptoCa (c) 




3%SA3-6 estimated* 



;ure 19. Third simulation study: A cbmpaflsori of 3ASA3-6A+6B estimated 
parameters vs. 3ASA3-6 estimated parameters. 



74 



IRT Pfeeqaating 
70 




IRT Preequatihg 
71 

AN APPENDIX ON SSliLARiTIES 
The iialn body of this paper concludes that the third 
simulation study produces results most consistent with the results 
for the real data* The third simulation introduced a shift in mean 
true ability in addition to a particular kind of tnultidimensionality 
in the »: atai This resulted in an overestimation of the Iteta 
difficulties, and a distorted distribution of estimated abilities. 
These two results were also seen in the real data. Neither of the 
other two simulation studies produced these simultaneous results for 
both item parameter estimates and ability estir 'tes. 

it is important to note, however, that just because the third 
simulation study produced results that resemble real-data results 
does not imply that the same mechanism, i.e., a decreased mean 
ability and multidimensionality, necessarily produced the real 
datai Unfortunately, one can never know what mechanism actually 
produced the real data. All that can be done is to study as 
carefully as possible all characteristics of both the real data and 
the simulated data, looking for further similarities. 

in this spirit, the following analysis, suggested by Marilyn 
Wingersky, was performed. The results provide further evidence for 
the consistency of simulated and real results. 



78 



IRT Preequatlrig 
72 

The. Qaesttons ia Be iaiswered 

1. Sample 3 from the third simuiation study took two blocks 
of items. On one block, the mean true ability was decreased when 
compared with other samples; On the other block, mean true ability 
was decreased and muitidimensionaiity introduced. If abilities 
were estimated separately from the two sets of items, how would the 
ability estimates compare with each other? 

2. A particular sample of real examinees in the data 
collection design described in Elgnor and Stocking (1986) took two 
blocks of items. One set was 18 pretest items from pretest form 
C1613. The other set was a block of 34 items from a section of an 
operational form, CI. This block of 34 items was combined with 
additional items not included in the Eignor and Stocking study to 
produce reported SAT mathematical scores for the sample of 
examinees. If abilities were estimated separately from the two 
sets of items, how would the ability estimates compare with each 
other? 

3. Do the two sets of estimated abilities, one for the 
simulated data and the second for the real data^ resemble each 
other? If so^ then the plausibility of the conciasion that both 
sets of data produced consistent results is strengthened. 



77 



IRT Preequating 
73 

The Method , Caution ^ and a^ Standard of Comparison 
One mechanism for comparing ability estimates from different 
sets of items is to examine how well these ability estimates are 
fit by the estimated response functions for items included in the 
ability estimate as well as items excluded from the ability 
estimate. Item-ability regression plots provide a convenient 
graphical method for making these comparisons. (See Kingston and 
Dorans (1985), for a detailed explanation of these plots.) The 
solid curvess in the plots used here are the item response functions 
c.omputed using the estimated item parameters from LOGIST. Each of 
tK'» different distributions of estimated abilities is grouped 
identically, and the observed proportions of examinees responding 
correctly to the item within a particular ability group are plotted 
with different symbols for each distribution of estimated ability. 

For both sets of data, simulated and real^ we examined two 
ability estimates^ each based on a sing- block of items. 
Necessarily^ theri^ when we examine the item-ability regression for 
a single item^ one ability esMmate is based on this and other items 
in the same block. The other ability estimate is derived from a 
sieparate block of items tlK;t Joes hot include the item under 
consideration. Aside from sampling error ^ we expect oh theoretical 
grounds that the observed proportions for the ability estimates 
based on separate blocks of items will differ in a systematic way. 



78 



74 

In particular^ we expect the rough curve formed by the proportions 
observed for ability estimates that include this item to be steeper 
than the corresponding v: ..rh curve based on ability estimates that 
exclude this item. 

This phenomenon occurs for exactly the same reason that the 
conventional biserial correlation between item score and total test 
score is higher when the total test score inc'^udes the item under 
consideration. Lord (1980, p. 33 and p* 40) shows that there is an 
approximate functional relationship betvgreen the biserial correlation 
and the IRT discrimination parameter under certain restrictive 
assumptions. If the assumptions are not met, the relationship 
becomes cruder, but does not disappears 

The rough c'jirves formed by the two S3ts of observed proportions 
can be viewed as empirical item response functions. Since the 
discrimination parameter Is a function of the slope of an item 
response function, we find that the slope is steeper for the 
empirical curve based on estimated abilities that include the item 
under consideration. 

All of this implies that before we can compare our simulated 
arid real data we need some standard of what is sien wSen comparing 
estimated abilities based on different blocks of items under ideal 
conditions. To produce such a standard, a new set of artificial 
SAT mathematical data was constructed in which each of 2500 



79 



IRT Preequatihg 
75 

isltnulees was administered two blocks of Items • Each block contained 
the same 60 items for a total of 120 items per person. The items 
and simiilees were calibrated using LOGIST. The items were then 
split into the two sets of 60 items and abilities estimated 
separately using the estimated item Tarameters for each set of 60 
items • Item-ability regressions were then plotted for all items 
with the two ability estimates plotted with different plotting 
symbols. 

The results are shown for six items in Figures A~l and A 
'plus' symbol denotes observed proportions fron groups of abilities 
estimated from the first 60 items; a 'hexagon' is used for observed 
proportions from abilities estimated from the second 60 items. 
Items 3i 6^ and are in the first block of 60; the empirical curve 
formed by the abilities estimated without these items (hexagons) is 
less steep than that formed by the abilities estimated from items 
included in this block (pluses). Items 63 ^ 66, and 92 are in the 
second block of items,* Here the empirical curve formed by the 
abilities estimated from the first block of items (pluset:) is less 
steep. These six item-ability regressions represent the most 
noticeable differences between ability estimates based on identical 
nonoverlapping blocks of items out of the 120 items. They can be 
used as a standard against which to compare subsequent results. 

Insert Figures A-1 and A-2 about here 



80 



IRT Preequatlng 
76 

Results for Simulated Data 
Sample 3 In the third simulation was administered 24 items as 
equating section fn and 30 items as 'pretest' section 3ASA3-6B. 
Simulees responded to the 24-item block with mean true ability • 
decreased when compsired to the other two samples in the third 
simulation. A particular kind of multidimensibnality was introduced 
into the responses to the 30-item 'pretest' sections Using item 
parameter estimates from the LOGIST calibration performed in the 
third simulation^ abilities were separately estimated for these two 
nonoverlapping blocks of items. Item-ability regressions for three 
items from the 24-item block are shown in Figure A-3 and for three 
. ^-^ from the 'pretest' block in Figure A-4. These particaiar 
.tams were chosen because they show the most discrepancy between 
the ability estimated. 

A 'plus* is used to pltt observed proportions from grouped 
abilities estimated from the 24-item block; a 'hexagon' is used for 
proportions based on groupeu abilities estimated from the 3b-item 
'pretest' block. In Figure A~3^ the observed proportions from 
grouped abilities estimated from 24-item block (pluses) are 
reasonably well fit by the estimated item response functloni 
However, the observed proportions from grouped abilities estimated 
from 'pretest' items (hexagons) are less well fit. A comparison 
with the standards shown in Figures A-1 and A-2 indicates that this 



81 



iRT Preeqaating 
77 

lack of fit is larger than would be expected on the basis of the 
expected systematic variation alone. The reverse phenomenon is 
obiserved for the three 'pretest' items in Figure A~4, Here too the 
results are larger than the systematic variation shown in the 
standards of Figures A-1 and A-2. In addition, the observed 
proportions based on ability estimates from the 24-item block 
('pluses') are better fit by the 'pretest* item response functions 
(Figure A-4) than the observed proportions based on ability 
estimates from the 'pretest' items (hexagons) are fit by the it 2m 
response functions in the 24-item block (Figure A-3). That is, the 
'pluses' fall closer to the curves in Figure A-4 than the 'hexagons' 
do in Figure A-3. This is because of the multidimensionality 

introduced in responses to 'pretest' iterj . 

» 

Insert Figures A-3 and A-4 about here 
The conclusions to be drawn from these item-ability regressions 

are: 

1. The abilities estimated from two different sets of items 
are^ in fact^ different. This reflects the deliberate 
modeling of a particular kind of multidimensionality. 

2. Given the magnitude of the multidimensionality actually 
modeled ^ it is surprising that the abilities estimated 
from the two different sets of items are as similar as they 
are. 



82 



iRT Preequatlng 
78 

3. Observed proportions based on abilities estimated from 
Items for which responses were simulated to fit the 
unidimensional model are fit as well or better by the 
estimated item response functions than are the observed 
proportions based on abilities estimated from items for 
which multldlmenslonallty was introduced. 

Results for Real Data 
A particular sample of people that were included in a large 
LOGIST calibration described in Eignor and Stocking (1986) also took 
two nonoverlappirig blocks of items. Eighteen items were pretest 
items from form C1613; 34 items were items, that contributed to the 
final SAT mathematical score for this sample of examinees and that 
were included in the large LOGIST calibration. Tl^ese latter items 
will be referred to as 'operational.' Not included in the LOiGIST 
calibration were the remaining 25 items that contributed to the 
final SAT mathematical score for this sample. 

Using item parameter estimates from this large LOGIST 
calibrati ohy abilities were separately estimated the two 
nbndverlapping blocks of items. Item-ability regressions for 
three 'operational' items are shown in Figure A-5 and for three 
pretest items in Figure A-6. These particular items were chosen 
because they show the most discrepancy between ability estimates. 

Insert Figures A-5 and A-6 about here 



IRT Preequatlrig 
79 

A 'plus' sjonbol indicates that ability was estimatied from the 
operational items; a 'hexagon' indicates that ability was estimated 
from pretest items i In Figure A-5 the observed proportions from 
grouped abilities estimated from operational itetas (pluses) are 
reasonably well fit by the estimated item response function^ 
However, the observed proportions from abilities estimated from the 
pretest items (hexagons) are less well fit. A comparison with the 
standards shovm ir gures A-1 and A-2 indicates that this lack of 
fit is larger than would be expected. The reverse phenomenon is 
observed for the three pretest items In Figure A-6. In addition, 
the observed proportions based on ability estimates from the 
operations items (pluses) are fit as well or better by all item 
response functions shown in Figures A-5 and A-6 than the observed 
proportions based on abilities estimated from pretest itemsi 

The conclusions that can be drawn from these item-ability 
regressions are: 

1. The abilities estimated from the pretest and operational 
items are, in fact, different. 

2. The nature and magnitude of the differences between the 
two setp of ability estimates is roughly the same as that 
seen in che simulated data. This is most easily seen by 
comparing Figure A-3 with Figure A-5 , and Figure A-4 with 
Figure A-6. 



84 



IRT Preequatlhg 
80 

Conciasions from jThis Iny^^lgatlbn 
It is clear that the behavior of real examinees studied here 
was different on pretest items from their behavior on the 
operational Items. The consequences of this different behavior 
produce results consistent with results for simulated data \^.:r^ 3 
mean shift and a particular type of multidimensional Ity were 
introduced^ it is also clear that the nature and magniLuc!^ c^. the 
differences are similar^ 

The results of this analysis do not prove that the underlying 
causative mechanisms are the same for both the real and the 
simulated data. Indeed, there is no analysis that can be performed 
that will do so. e results do, however, strengthen the 

assertion that th ^Si md simulated results are consistent with 
each other. 



85 




5 CH 



IRT Preequatltlg 
81 




5 CH 




Figure A~l. Item-ability regressions: for three items from the first block of 
60 items for simulated data where each simulee responded to 
120 items. A 'plus' symbolizes observed proportion for 
abilities estimated on items 1-60; a 'hexagon' symbolizes 
observed proportions for abilities estimated on items 61-120. 



8g 



IRT Preequatlhg 
82 




5 CH 




5 CH 




H-lgure A-2. Item-ability regressions for three items from the second block of 
60 it ems fbir simulated data where each simulee responded to 
120 itiems, A *plus* symbolizes observed proportion for 
abilities estimated on item 1-60; a 'hexagon* symbolizes 
observed proportions foi abilities estimated on items 61-] 20. 



87 

ERIC 



IRT Prec ^tihg 
83 





Figure A-3. Item-ability regressidnir; for three items for Sample 3 from third 
simulation. Responses to these items were simulated with a 
unidimensional model. A 'plus* symbolizes observed proportions for 
abilities estimated on the block of items for which responses were 
l^r^^^^^^-^-^^nal ; a *hexagbn* nymbolizes observed proportions for 
abii.ttlGzi 'f?stimated oti the block of Ica^vji for which 
mutidisaensionality was introduced. 



88 



IRT Preequatlhg 
84 





Item-abl] ' ty regressions for three items for Sample 3 from third 
simulation. Responses to these items were simulated with a 
particular jnidimfcnsional model. A 'plus' symbolizes observed 
proportions for abilities estimated on the block of items for which 
responses were unidimensional; a 'hex i' symbolizes observed 
proportions for abilities estimated le block of items for which 

mutidimensionality was introduced. 



89 



IRT Preequatlng 
85 





Figure A-5. Item- ability regressions for real data, sample 10 from Eignor and 
Stocking (1986). These three Items are from the block of 
operational Items. A * pias ' symbolizes observed proportions for 
abilities estimated on operational Xtams; a 'hexagon' symbolizes 
observed proportions for abilities estimated on pretect items. 



90 



IRT Preeqaating 
86 





Figure A~6. Item-Lability regressions for real data, sample 10 from Eignor and 
Stocking (1986). These three, items are from the block of pretest 
items. A 'plus' 4?ymbblizes observed proportions for abilities 
estimated on operational items; a 'hexagon' symbolizes observed 
proportions for abilities estimated on pretest items. 



91 



